Case Study - U.S. 1992-2015 Wildfires


Author: Colby Huang

Data sourced from https://www.kaggle.com/rtatman/188-million-us-wildfires

The dataset used for this case study is a database of about 1.88 million wildfires in the United States from the years 1992 to 2015. This dataset interests me because my family lives in California, where over the last few years there have been many fierce fire seasons and summer skies have often been obscured by smoke. Since my hobby is astrophotography, which demands clear night skies, the frustration of having nights ruined by the smoky haze despite clear weather piqued my interest in these wildfires and whether it had always been like this.

August Complex smoke

Source: firemap.sdsc.edu

The above is a timelapse of smoke from the August Complex, a 2020 wildfire complex that burned over a million acres and remains the largest wildfire in California history. While this dataset doesn't contain data that recent, studying it still may reveal useful trends.

Potential Questions

Exploratory Data Analysis

Let's import our dataset, starting with the Fires table. The dataset is split into 24 CSV files in /data, one for each year from 1992 to 2015 inclusive. We'll load each year's data and concatenate it with the rest. Depending on your system, this may take a short while. We should end up with 1880465 rows and 38 columns minus the ones we drop.

Let's examine the dataset. A description of what each of these columns are can be found at https://www.kaggle.com/rtatman/188-million-us-wildfires. The DISCOVERY_DATE and CONT_DATE columns have been reformatted as date strings (yyyy-mm-dd).

Who is this data from?

Let's take a look at the sources of these fire reports.

The column SOURCE_SYSTEM_TYPE indicates if the record was drawn from a federal, nonfederal, or interagency database.

It looks like non-federal records make up most of the dataset.

Let's look for the top National Wildlife Coordinating Group agencies that prepared the most fire reports. The column we'll want to look at is NWCG_REPORTING_AGENCY.

As was somewhat expected from the previous plot, state/county & local reports dominate the data. Let's look only at federal agencies.

Filtering out ST/C&L (State/County and Local), IA (Interagency), and TRIBE (Tribal organizations), we can look at the federal agencies that contributed the most reports.

The U.S. Forestry Service (FS), Bureau of Indian Affairs (BIA), and the Bureau of Land Management (BLM) dominate the fire reports.

Changes in number of reported incidents over time

Let's start by checking how the number of reported incidents changes over time from 1992-2015. The FIRE_YEAR column describes the year of the incident.

The years with the highest fire report counts were 2006, 2000, 2007, 2011, and 1999. Here's how the number of reports changed over the years:

The highest year for reported fire incidents was 2006. In this chart we see there are several peaks in the number of reported fire incidents, and they appear somewhat regularly spaced out. This might indicate a cycle of yearly wildfire severity.

To further investigate how wildfire severity changes, let's look at how fire sizes change each year. The data provides a column called FIRE_SIZE_CLASS which classifies fires by the size of final perimeter in acres, with fires labeled with a letter from A-G, G being the largest at 5000+ acres and A being the smallest at 0.25 acres or less. We can start with this column and count how many fires of each size class were reported each year.

We can see that fires are most often in the "B" size class, which is between 0.25 and 10 acres. Large fires in higher size classes make up a small proportion of yearly fires.

Let's see our yearly number of fire incidents again, this time with the percentage of fires that are G class (5000 acres or higher) overlaid.

Years with higher proportion of very large G class fires somewhat cluster around peaks in yearly reported incidents. However, there isn't a strong direct correlation between the number of reports and the proportion of G class fires.

Total Area of Final Perimeters by Year

We'll find the total area of the final perimeters for each size class of fire, then create a stacked area plot to look at both the total area burned and the makeup of burned areas by size class.

We can see in this stacked area plot that G class fires (the largest class) dominate the national totals in terms of total area burned, even though we previously saw they made up a very small proportion of our reported incidents. We don't really see strong patterns in the total yearly area, though we do notice high total area burned in years that had a high number of incidents in our previous bar plot. The peaks in area do seem to increase over time somewhat, though we don't have data beyond 2015 to see where it peaked there.

However, if we limit the data to more localized areas, we may see a different picture:

If we limit the data to just one fire prone state, there looks like there could be a cycle in the yearly total area of fires. In California's fire incidents we can see there is a peak about every 3-5 years, the largest in this data being 2008 with about 1.4 million total acres in the final perimeters. The magnitude of these peaks also seem to increase from 1996-2008. It makes sense that we would see stronger peaks in more localized data, since local effects like fuel buildup would have a stronger effect on the data.

Going back to the nationwide data, plotting total fire area versus year:

There is a significant upward trend over time of the total area burned yearly, though it is not a strong correlation.

Causes of wildfires

In this section we'll look at the reported causes of these wildfires and look for patterns in them.

The column STAT_CAUSE_DESCR describes the reported cause of each wildfire report. We can start by finding what causes of wildfire are the most common. The top 5 causes of wildfire can be found by grouping the data by cause description and counting the number of fire reports for each cause:

The most common fire cause in our dataset is "debris burning". "Arson" and "lightning" are also in the top 5 most common causes of wildfires. Fires with miscellaneous causes and unknown/missing causes are also common.

We may also be interested in how the causes of wildfires change throughout the year. The dataset provides a DISCOVERY_DOY column that shows when in the year the fire was discovered as a number from 1-365. We'll use this to create a histogram to analyze when in the year each cause of fire is the most common.

Before grouping by cause, we should look at when in the year fires are the most common overall.

Fires reports appear to be bimodal with peaks around March-April and July-August.

Next, let's create density plots and histograms for each fire cause to find the distribution of day of discovery for each individual cause.

Our first plot, a density plot, shows how a lot of the bimodal nature of the days of discovery can be explained from the combination of "Debris burning" and "Lightning" reports. However it is hard to compare the individual causes and how their days of discovery are distributed. The second plot has a separate histogram for each fire cause descriptor and makes it easier to compare the individual distributions. We can see that many of them have a peak somewhere around a quarter or a third into the year, which is somewhere around late spring/early summer. It is interesting how arson-caused fires have such a strong peak around March-May; it may be worth investigating why perpetrators would commit arson more often around these months. Some of them, like camping and equipment use, steadily increase to a peak in mid-summer and then steadily decrease. These are probably related to a combination of summer heat and more people vacationing. It's also notable that fireworks-caused fires are almost all in one bin in the middle of the year, likely corresponding to July 4th celebrations. Lightning-caused fires have a distinct bell-shaped distribution with a very slight skew and peak around 200 days into the year, around late July.

Next we'll look at how cause of fire relates to location.

We'll create vertical boxplots of latitude to see if different causes of wildfires usually appear at different latitudes. Since seasonal temperature swings and weather can be influenced by latitude, this may help us understand how these are related to our fires. We'll also find the mean latitude for each cause to order the fire causes in a more visually appealing way in the box plots.

We removed data from Hawaii and Alaska to reduce outlying points. It looks like fireworks-caused fires often occur further north, and rail-caused fires often occur firther south. It may also be worth investigating the fire causes with a smaller interquartile range, like arson and debris burning, since the low IQR may indicate these to be more localized.

Finally, below, we'll find the total area of a fire for each of these causes. We will include AK and HI once again for this. We will first create a table of the fire causes with the most area, then create a treemap for a easy to interpret visual comparison of these areas. Additionally we'll group the data by size class to analyze how different causes of fires result in fires of different sizes.

Lightning is by far the largest contributor to total area of fires. In most of these G class fires represent the largest fraction of the area burned, but they especially dominate lightning-caused fires with 86% of the total area burned. It makes sense that lightning fires may tend to create large fires rather than many small fires. Lightning often strikes in remote areas where detection may be delayed and access to firefighters is difficult. In addition, multiple lightning strikes in such areas may combine as they grow.

Summary